Take-home Exercise 3

This take home exercise aims to explain factors affecting resale prices of public housing in Singapore by building hedonic pricing models using appropriate GWR methods.

Genice Goh true
11-05-2021

1.0 Introduction

Housing price is affected by a variety of factors, including global factors such as the general economy of a country or property specific factors such as structural variables related to the property and locational variables related to the neighbourhood.

Hedonic pricing model is used to examine the effect of housing factors as on property price. However, this method fails to take into consideration that spatial autocorrelation and spatial heterogeneity exist in housing transaction data, which could lead to biased results (Anselin 1998). In view of this limitation, Geographical Weighted Regression (GWR) was introduced to calibrate hedonic price models for housing.

As such, we will be building hedonic pricing models to explain factors affecting the resale prices of public housing in Singapore using appropriate GWR methods.

1.1 The Data

The data sets used for this analysis include:

1.2 Installing and Loading Packages

The packages used for this analysis include:

packages = c('olsrr', 'corrplot', 'ggpubr', 'sf', 'spdep', 'GWmodel', 'tmap', 'tidyverse', 'httr', 'units', 'matrixStats')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

2.0 Geospatial Data

2.1 Importing Geospatial Data

st_read() of sf package is used to import the geospatial data, which is in shapefile format.

mrt_sf <- st_read(dsn="data/geospatial",
               layer="MRTLRTStnPtt")
Reading layer `MRTLRTStnPtt' from data source 
  `C:\Users\user\Desktop\SMU\Y4S1\IS415\geniceee\IS415_blog\_posts\2021-11-05-take-home-exercise-3\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 185 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 6138.311 ymin: 27555.06 xmax: 45254.86 ymax: 47854.2
Projected CRS: SVY21
bus_sf <- st_read(dsn="data/geospatial",
               layer="BusStop")
Reading layer `BusStop' from data source 
  `C:\Users\user\Desktop\SMU\Y4S1\IS415\geniceee\IS415_blog\_posts\2021-11-05-take-home-exercise-3\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 5156 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 4427.938 ymin: 26482.1 xmax: 48282.5 ymax: 52983.82
Projected CRS: SVY21
mpsz_sf <- st_read(dsn="data/geospatial",
               layer="MP14_SUBZONE_WEB_PL")
Reading layer `MP14_SUBZONE_WEB_PL' from data source 
  `C:\Users\user\Desktop\SMU\Y4S1\IS415\geniceee\IS415_blog\_posts\2021-11-05-take-home-exercise-3\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21

From the output message, we can see that:

2.2 Data Preprocessing

Before we visualise the geospatial data, we will need to conduct data preprocessing to ensure that there are no invalid geometries and missing values.

2.2.1 Invalid Geometries

length(which(st_is_valid(mrt_sf) == FALSE))
[1] 0
length(which(st_is_valid(bus_sf) == FALSE))
[1] 0
length(which(st_is_valid(mpsz_sf) == FALSE))
[1] 9

There are no invalid geometries in both mrt_sfand bus_sf data frames while the mpsz_sf data frame contains 9 invalid geometries. We will now proceed to remove the invalid geometries in the mpsz_sf data frame.

mpsz_sf <- st_make_valid(mpsz_sf)

# Check again for invalid geometries
length(which(st_is_valid(mpsz_sf) == FALSE))
[1] 0

2.2.2 Missing Values

mrt_sf[rowSums(is.na(mrt_sf))!=0,]
Simple feature collection with 0 features and 3 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
[1] OBJECTID STN_NAME STN_NO   geometry
<0 rows> (or 0-length row.names)
bus_sf[rowSums(is.na(bus_sf))!=0,]
Simple feature collection with 1 feature and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 22616.75 ymin: 47793.68 xmax: 22616.75 ymax: 47793.68
Projected CRS: SVY21
    BUS_STOP_N BUS_ROOF_N LOC_DESC                  geometry
264      47201        UNK     <NA> POINT (22616.75 47793.68)
mpsz_sf[rowSums(is.na(mpsz_sf))!=0,]
Simple feature collection with 0 features and 15 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
 [1] OBJECTID   SUBZONE_NO SUBZONE_N  SUBZONE_C  CA_IND     PLN_AREA_N
 [7] PLN_AREA_C REGION_N   REGION_C   INC_CRC    FMEL_UPD_D X_ADDR    
[13] Y_ADDR     SHAPE_Leng SHAPE_Area geometry  
<0 rows> (or 0-length row.names)

We can see that there is 1 observation with missing values in the bus_sf data frame. We will remove it because another bus stop of the same BUS_ROOF_N is found after further data exploration.

bus_sf <- bus_sf[!rowSums(is.na(bus_sf))!=0,]

# Check again for missing values
bus_sf[rowSums(is.na(bus_sf))!=0,]
Simple feature collection with 0 features and 3 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
[1] BUS_STOP_N BUS_ROOF_N LOC_DESC   geometry  
<0 rows> (or 0-length row.names)

2.2.3 Duplicated Values

mrt_sf[duplicated(mrt_sf$STN_NAME),]
Simple feature collection with 19 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 22949.03 ymin: 28735.78 xmax: 42362.71 ymax: 46418.56
Projected CRS: SVY21
First 10 features:
    OBJECTID                STN_NAME STN_NO                  geometry
33        33  ANG MO KIO MRT STATION   NS16 POINT (29807.27 39105.77)
97       105  MACPHERSON MRT STATION   DT26  POINT (34235.8 34256.43)
103      111       BUGIS MRT STATION   EW12 POINT (30491.56 31424.35)
114      122        EXPO MRT STATION   DT35 POINT (42362.71 35285.67)
116      124 BUONA VISTA MRT STATION   CC22 POINT (23251.76 32090.77)
121      129  MARINA BAY MRT STATION    CE2 POINT (30423.43 28735.78)
124      132   CHINATOWN MRT STATION   DT19 POINT (29238.97 29686.54)
134      142    TAMPINES MRT STATION   DT32 POINT (40213.03 37463.37)
140      148   SERANGOON MRT STATION   NE12 POINT (32480.07 36869.39)
144      152  PAYA LEBAR MRT STATION    CC9 POINT (34550.58 33300.34)
bus_sf[duplicated(bus_sf$BUS_STOP_N),]
Simple feature collection with 13 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 13488.02 ymin: 29969.22 xmax: 44144.57 ymax: 47806.75
Projected CRS: SVY21
First 10 features:
     BUS_STOP_N BUS_ROOF_N             LOC_DESC
350       58031        UNK      OPP CANBERRA DR
1569      51071        B21 MACRITCHIE RESERVOIR
2208      82221        B01               Blk 3A
2215      97079        B14  OPP ST. JOHN'S CRES
2392      62251        B03         BEF BLK 471B
2462      22501        B02             BLK 662A
2736      16119        NIL        TELETECH PARK
2976      58229       B01A          BLK 358A CP
3442      67421        NIL CHENG LIM STN EXIT B
3521      68091        B08         AFT BAKER ST
                      geometry
350   POINT (27089.69 47570.9)
1569 POINT (28305.37 36036.67)
2208 POINT (35308.74 33335.17)
2215 POINT (44144.57 38980.25)
2392 POINT (35500.36 39943.34)
2462 POINT (13488.02 35537.88)
2736 POINT (22318.89 29969.22)
2976 POINT (26094.86 47806.75)
3442 POINT (34741.77 42004.21)
3521 POINT (32038.84 43298.68)
mpsz_sf[duplicated(mpsz_sf$SUBZONE_C),]
Simple feature collection with 0 features and 15 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
 [1] OBJECTID   SUBZONE_NO SUBZONE_N  SUBZONE_C  CA_IND     PLN_AREA_N
 [7] PLN_AREA_C REGION_N   REGION_C   INC_CRC    FMEL_UPD_D X_ADDR    
[13] Y_ADDR     SHAPE_Leng SHAPE_Area geometry  
<0 rows> (or 0-length row.names)

We can observe that there are 19 and 13 duplicated observations in the mrt_sf and bus_sf data frames respectively, while there are none in the mpsz_sf data frame. We will proceed to remove the duplicated observations.

mrt_sf <- mrt_sf[!duplicated(mrt_sf$STN_NAME),]
bus_sf <- bus_sf[!duplicated(bus_sf$BUS_STOP_N),]

# Check again for duplicates
mrt_sf[duplicated(mrt_sf$STN_NAME),]
Simple feature collection with 0 features and 3 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
[1] OBJECTID STN_NAME STN_NO   geometry
<0 rows> (or 0-length row.names)
bus_sf[duplicated(bus_sf$BUS_STOP_N),]
Simple feature collection with 0 features and 3 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
[1] BUS_STOP_N BUS_ROOF_N LOC_DESC   geometry  
<0 rows> (or 0-length row.names)

2.3 Verify Coordinate Reference System

We will first need to retrieve the coordinate reference system for verification. st_crs() of sf package is used to do this.

st_crs(mrt_sf)
Coordinate Reference System:
  User input: SVY21 
  wkt:
PROJCRS["SVY21",
    BASEGEOGCRS["SVY21[WGS84]",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ID["EPSG",6326]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["Degree",0.0174532925199433]]],
    CONVERSION["unnamed",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]]
st_crs(bus_sf)
Coordinate Reference System:
  User input: SVY21 
  wkt:
PROJCRS["SVY21",
    BASEGEOGCRS["WGS 84",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ID["EPSG",6326]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["Degree",0.0174532925199433]]],
    CONVERSION["unnamed",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]]
st_crs(mpsz_sf)
Coordinate Reference System:
  User input: SVY21 
  wkt:
PROJCRS["SVY21",
    BASEGEOGCRS["SVY21[WGS84]",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ID["EPSG",6326]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["Degree",0.0174532925199433]]],
    CONVERSION["unnamed",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]]

From the output messages, we can observe that the EPSG code for the data frames is currently 9001. This is wrong because the EPSG code of projection coordinate system SVY21 is supposed to be 3414, instead of 9001.

st_set_crs() of sf package is therefore used to assign the correct EPSG code for the data frames.

mrt_sf <- st_set_crs(mrt_sf, 3414)
bus_sf <- st_set_crs(bus_sf, 3414)
mpsz_sf <- st_set_crs(mpsz_sf, 3414)

We will now use st_crs() of sf package to retrieve the coordinate reference system again.

st_crs(mrt_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
st_crs(bus_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
st_crs(mpsz_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

The EPSG code is now correctly assigned for all sf data frames!!

2.4 Visualising Geospatial Data

It is useful to plot a map to visualise the geospatial data, so that we can easily get a preliminary look at the spatial patterns.

tmap_mode('view')

tm_shape(mrt_sf) +
  tm_dots(col="red") +
tm_shape(bus_sf) + 
  tm_dots(col="blue")

If we look closely, we can see that there are 5 bus stops outside of Singapore’s boundary (46211, 46219, 46239, 46609, 47701). As we are able to travel to and fro Johor Bahru with specific buses, there are designated bus stops located at Johor Bahru.

As such, we should remove these bus stops before proceeding with our analysis.

2.5 Further Data Preprocessing

In this section, we will proceed to remove the bus stops identified earlier.

2.5.1 Inspect the specific bus stops

omit_bus <- list(46211, 46219, 46239, 46609, 47701)
bus_sf[bus_sf$BUS_STOP_N %in% omit_bus,]
Simple feature collection with 5 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 17969.14 ymin: 49173.42 xmax: 20723.24 ymax: 52983.82
Projected CRS: SVY21 / Singapore TM
     BUS_STOP_N BUS_ROOF_N            LOC_DESC
765       47701        NIL          JB SENTRAL
1532      46239         NA          LARKIN TER
2257      46609         NA     KOTARAYA II TER
2269      46211        NIL JOHOR BAHRU CHECKPT
4346      46219        NIL JOHOR BAHRU CHECKPT
                      geometry
765  POINT (20370.54 49173.42)
1532 POINT (17969.14 52983.82)
2257  POINT (20025.46 49261.6)
2269 POINT (20623.18 49199.99)
4346 POINT (20723.24 49200.05)

We can observe that the location descriptions of these bus stops indeed indicate that they are situated in Johor Bahru. We will therefore need to remove them.

bus_sf <- bus_sf[!bus_sf$BUS_STOP_N %in% omit_bus,]

# Check again if we have removed successfully
bus_sf[bus_sf$BUS_STOP_N %in% omit_bus,]
Simple feature collection with 0 features and 3 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21 / Singapore TM
[1] BUS_STOP_N BUS_ROOF_N LOC_DESC   geometry  
<0 rows> (or 0-length row.names)

3.0 Aspatial Data

3.1 Obtaining Aspatial Data

We will be making use of the package onemapsgapi to query the required data sets from OneMap API. We will then save these data sets in CSV format.

library(onemapsgapi)

token <- "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOjc5NTUsInVzZXJfaWQiOjc5NTUsImVtYWlsIjoiZ2VuaWNlLmdvaC4yMDE4QHNjaXMuc211LmVkdS5zZyIsImZvcmV2ZXIiOmZhbHNlLCJpc3MiOiJodHRwOlwvXC9vbTIuZGZlLm9uZW1hcC5zZ1wvYXBpXC92MlwvdXNlclwvc2Vzc2lvbiIsImlhdCI6MTYzNTU3OTEzMywiZXhwIjoxNjM2MDExMTMzLCJuYmYiOjE2MzU1NzkxMzMsImp0aSI6ImIzZmEyOTg1MDQyZTFiOGY0MWYzMWJlNzFmYzdlZTEwIn0.z2gh5tK3ezz0lUXLoPilPVOfN0MaXJm4N4RbAtVIQFI"

data <- list("eldercare", "hawkercentre", "relaxsg", "supermarkets", "kindergartens", "childcare")

for (d in data) {
  df <- get_theme(token, d)
  write_csv(df, str_replace("data/aspatial/df.csv", "df", d))
}

3.2 Importing Aspatial Data

read_csv() of readr package is used to import the CSV files. The outputs are tibble data frames.

eldercare = read_csv("data/aspatial/eldercare.csv")
hawker = read_csv("data/aspatial/hawkercentre.csv")
park = read_csv("data/aspatial/relaxsg.csv")
supermarket = read_csv("data/aspatial/supermarkets.csv")
kindergarten = read_csv("data/aspatial/kindergartens.csv")
childcare = read_csv("data/aspatial/childcare.csv")
school = read_csv("data/aspatial/schools.csv")
mall = read_csv("data/aspatial/shopping-mall.csv")
flat_resale = read_csv("data/aspatial/resale-flat-prices.csv")

It is also important to understand the data that we are working with. glimpse() of dplyr package is therefore used to perform exploratory data analysis.

glimpse(eldercare)
glimpse(hawker)
glimpse(park)
glimpse(supermarket)
glimpse(kindergarten)
glimpse(childcare)
glimpse(school)
glimpse(mall)
glimpse(flat_resale)

3.3 Data Preparation

We will proceed to prepare the data such that they can be used later for the preparation of independent variables. The steps taken include:

# remove redundant columns for eldercare dataset
eldercare <- eldercare %>%
  select(c(1, 4:5))

# remove redundant columns for hawker dataset
hawker <- hawker %>%
  select(c(1, 14:15))

# remove redundant columns for park dataset
park <- park %>%
  select(c(1, 8:9))

# remove redundant columns for supermarket dataset
supermarket <- supermarket %>%
  select(c(1, 7:8))

# remove redundant columns for kindergarten dataset
kindergarten <- kindergarten %>%
  select(c(1, 5:6))

# remove redundant columns for childcare dataset
childcare <- childcare %>%
  select(c(1, 5:6))

# filter out primary schools; remove redundant columns 
prischool <- school %>%
  filter(mainlevel_code == "PRIMARY") %>% 
  select(c(1, 3:4, 25,27))

# filter out 4-room flats with transaction period from 01-Jan-19 to 30-Sep-20
flat_resale <- flat_resale %>% 
  filter(flat_type == "4 ROOM") %>%
  filter(month >= "2019-01" & month < "2020-10")

3.4 Data Preprocessing

We will need to conduct data preprocessing to ensure that there are no NA values in all data frames.

3.4.1 NA Values

eldercare[rowSums(is.na(eldercare))!=0,]
hawker[rowSums(is.na(hawker))!=0,]
park[rowSums(is.na(park))!=0,]
supermarket[rowSums(is.na(supermarket))!=0,]
kindergarten[rowSums(is.na(kindergarten))!=0,]
childcare[rowSums(is.na(childcare))!=0,]
prischool[rowSums(is.na(prischool))!=0,]
mall[rowSums(is.na(mall))!=0,]
flat_resale[rowSums(is.na(flat_resale))!=0,]

There are no NA values in all the data frames.

3.5 Geocoding for Aspatial Data

After exploratory data analysis performed earlier, it is found that prischool and flat_resale data frames do not have Lat and Lng columns. These columns are required for the preparation of the independent variables later. We therefore need to use OneMap API to geocode the coordinates columns.

3.5.1 Create Address Column

We first need to create an address column for the flat_resale data frame since the address data is currently split into 2 columns block and street name respectively.

unite() of dpylr package is used to concatenate the 2 columns.

flat_resale <- flat_resale %>%
  unite("address", block:street_name, sep= " ")

3.5.2 Rename ST. to SAINT

We alo observe that some addresses in the the flat_resale data frame start with “ST.”. It is found later that this will pose problems during geocoding. As a result, we will need to rename “ST.” to its full version “SAINT”.

flat_resale$address <- gsub("ST\\.", "SAINT", flat_resale$address)

3.5.3 Geocoding Function

In this function, the input variable address is passed in as a search value in the query to the API. The output is then converted to a data frame, where we only choose to retain the Latitude and Longitude columns. These 2 columns are returned as the output.

geocode <- function(address) {

  url <- "https://developers.onemap.sg/commonapi/search"

  query <- list("searchVal" = address, "returnGeom" = "Y",
                "getAddrDetails" = "N", "pageNum" = "1")

  res <- GET(url, query = query, verbose())

  output <- content(res) %>%
    as.data.frame %>%
    select(results.LATITUDE, results.LONGITUDE)

  return(output)
}

3.5.4 Geocoding for Primary Schools

We will now loop through each row of the prischool data frame and implement the geocode function. The output is saved as 2 new columns Lat and Lng.

prischool$Lat <- 0
prischool$Lng <- 0

for (i in 1:nrow(prischool)) {
  output <- geocode(prischool$postal_code[i])
  
  prischool$Lat[i] <- output$results.LATITUDE
  prischool$Lng[i] <- output$results.LONGITUDE
}

3.5.5 Geocoding for Resale Flat Prices

We will now loop through each row of the flat_resale data frame and implement the geocode function. The output is saved as 2 new columns Lat and Lng.

flat_resale$Lat <- 0
flat_resale$Lng <- 0

for (i in 1:nrow(flat_resale)) {
  output <- geocode(flat_resale$address[i])
  
  flat_resale$Lat[i] <- output$results.LATITUDE
  flat_resale$Lng[i] <- output$results.LONGITUDE
}

3.6 Structural Factor Preparation

3.6.1 Floor Level

We need to conduct dummy coding on the storey_range variable for us to use it in model later. We will first look at the individual storey-range values for us to get a rough idea on the number of columns that will be produced.

unique(flat_resale$storey_range)

We observed that there are 17 storey range categories.

To conduct dummy coding, we will be using pivot_wider() of dplyr package to create duplicate variables representing every store-range. If the obeservation belongs to the storey-range, the value will be 1. The value will be 0 otherwise.

flat_resale <- flat_resale %>%
  pivot_wider(names_from = storey_range, values_from = storey_range, 
              values_fn = list(storey_range = ~1), values_fill = 0) 

3.6.2 Remaining Lease

We need to convert the remaining_lease column from string to numeric format to use in the model later. We will first split the string within the column and take the value(s) of the year and month. We will then calculate the remaining lease in years and replace the original value.

str_list <- str_split(flat_resale$remaining_lease, " ")

for (i in 1:length(str_list)) {
  if (length(unlist(str_list[i])) > 2) {
      year <- as.numeric(unlist(str_list[i])[1])
      month <- as.numeric(unlist(str_list[i])[3])
      flat_resale$remaining_lease[i] <- year + round(month/12, 2)
  }
  else {
    year <- as.numeric(unlist(str_list[i])[1])
    flat_resale$remaining_lease[i] <- year
  }
}

3.7 Locational Variable Preparation

3.7.1 Good Pri Sch Variable Preparation

We need to filter out good primary schools and save it as variable goodprischool for it to be used in the model. Good primary schools are defined to be Special Assistance Programme (SAP) primary schools or primary schools with the Gifted Education Programme (GEP).

gdprischool <- prischool %>%
  filter(sap_ind == "Yes" | gifted_ind == "Yes")

3.7.2 CBD Variable Preparation

The Central Business District (CBD) is in Downtown Core, located in the southwest part of Singapore. We will therefore take the coordinates of Downtown Core to be the coordinates of the CBD and convert it into a sf data frame.

lat <- 1.287953
long <- 103.851784

cbd_sf <- data.frame(lat, long) %>%
  st_as_sf(coords = c("long", "lat"), crs=4326) %>%
  st_transform(crs=3414)

3.7.3 Convert Datasets to sf

We need to convert the data sets into sf data frames for us to calculate the proximity distance matrices.

flat_resale_sf <- st_as_sf(flat_resale, 
                         coords = c("Lng", "Lat"), crs=4326) %>%
  st_transform(crs = 3414)

eldercare_sf <- st_as_sf(eldercare, 
                         coords = c("Lng", "Lat"), crs=4326) %>%
  st_transform(crs = 3414)

hawker_sf <- st_as_sf(hawker, 
                      coords = c("Lng", "Lat"), crs=4326) %>%
  st_transform(crs = 3414)

park_sf <- st_as_sf(park, 
                    coords = c("Lng", "Lat"), crs=4326) %>%
  st_transform(crs = 3414)

gdprischool_sf <- st_as_sf(gdprischool, 
                           coords = c("Lng", "Lat"), crs=4326) %>%
  st_transform(crs = 3414)

mall_sf <- st_as_sf(mall, 
                    coords = c("longitude", "latitude"), crs=4326) %>%
  st_transform(crs = 3414)

supermarket_sf <- st_as_sf(supermarket, 
                           coords = c("Lng", "Lat"), crs=4326) %>%
  st_transform(crs = 3414)

kindergarten_sf <- st_as_sf(kindergarten, 
                            coords = c("Lng", "Lat"), crs=4326) %>%
  st_transform(crs = 3414)

childcare_sf <- st_as_sf(childcare, 
                         coords = c("Lng", "Lat"), crs=4326) %>%
  st_transform(crs = 3414)

prischool_sf <- st_as_sf(prischool, 
                         coords = c("Lng", "Lat"), crs=4326) %>%
  st_transform(crs = 3414)

3.7.4 Proximity Distance Calculation

This function computes the distance to the nearest facility. st_distance() of sf package is used to compute the distance between all flats and the respective facilities. rowMins() of matrixStats is then used to take the shortest distance within each row in the output matrix. The values will be appended to the data frame as a new column.

proximity <- function(df1, df2, varname) {
  
  matrix <- st_distance(df1, df2) %>%
    drop_units()
  
  df1[,varname] <- rowMins(matrix)
  
  return(df1)
}

We will now implement the proximity function to the required variables.

flat_resale_sf <- proximity(flat_resale_sf, cbd_sf, "PROX_CBD") %>%
  proximity(., eldercare_sf, "PROX_EC") %>%
  proximity(., hawker_sf, "PROX_HA") %>%
  proximity(., mrt_sf, "PROX_MRT") %>%
  proximity(., park_sf, "PROX_PARK") %>%
  proximity(., gdprischool_sf, "PROX_GDPRI") %>%
  proximity(., mall_sf, "PROX_MALL") %>%
  proximity(., supermarket_sf, "PROX_SUP")

3.7.5 Facility Count within Radius Calculation

This function computes the facility count within a radius. Similarly, st_distance() of sf package is used to compute the distance between all flats and the respective facilities. rowSums() of dplyr is then used to count the observations with distances below the defined radius. The values will be appended to the data frame as a new column.

num_radius <- function(df1, df2, varname, radius) {
  
  matrix <- st_distance(df1, df2) %>%
    drop_units() %>%
    as.data.frame()
  
  df1[,varname] <- rowSums(matrix <= radius)
  
  return(df1)
}

We will now implement the num_radius function to the required variables.

flat_resale_sf <- num_radius(flat_resale_sf, kindergarten_sf, 
                             "NUM_KIN", 350) %>%
  num_radius(., childcare_sf, "NUM_CC", 350) %>%
  num_radius(., bus_sf, "NUM_STOP", 350) %>%
  num_radius(., prischool_sf, "NUM_PRI", 1000)

3.8 Saving the Dataset

Before saving the dataset, we will remove the redundant columns, rename the columns to shorter forms and relocate the price columns to the front of the data frame.

flat_resale_sf <- flat_resale_sf %>%
  select(5, 8, 9, 10:39) %>%
  rename("AREA_SQM" = "floor_area_sqm", "LEASE_YRS" = "remaining_lease", 
         "PRICE"= "resale_price") %>%
  relocate(`PRICE`)

We can now save the final flat resale price data set as a SHP file using st_write of sf package.

st_write(flat_resale_sf, "data/geospatial/resale-flat-prices-final.shp")

4.0 EDA

4.1 Import Full Dataset

st_read() of sf package is used to import the final data set, which is in shapefile format.

flat_resale_sf <- st_read(dsn="data/geospatial",
                          layer="resale-flat-prices-final")
Reading layer `resale-flat-prices-final' from data source 
  `C:\Users\user\Desktop\SMU\Y4S1\IS415\geniceee\IS415_blog\_posts\2021-11-05-take-home-exercise-3\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 15853 features and 32 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11597.31 ymin: 28217.39 xmax: 42623.63 ymax: 48741.06
Projected CRS: SVY21 / Singapore TM

4.2 EDA using statistical graphs

4.2.1 Dependent Variable

We can plot the distribution of PRICE using a histogram as shown below.

ggplot(data=flat_resale_sf, aes(x=`PRICE`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

The figure above reveals a slightly right skewed distribution. This means that more HDB flats were transacted at relative lower prices.

Statistically, the skewed distribution can be normalised using log transformation. mutate() of dplyr package is used to derive a new variable called LOG_PRICE using a log transformation on the variable PRICE.

flat_resale_sf <- flat_resale_sf %>%
  mutate(`LOG_PRICE` = log(PRICE)) %>%
  relocate(`LOG_PRICE`, .after = `PRICE`)

We can now plot the distribution of LOG_PRICE.

ggplot(data=flat_resale_sf, aes(x=`LOG_PRICE`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

4.2.2 Independent Variables

It is found that the LEASE_YRS column is still in string format. We will convert it to numeric format for us to conduct EDA and to input into the model later.

flat_resale_sf$LEASE_YRS <- as.numeric(flat_resale_sf$LEASE_YRS)

We will now draw multiple histograms to show multiple distributions of the independent variables. This is done using ggarrange() of ggpubr package.

AREA_SQM <- ggplot(data=flat_resale_sf, aes(x= `AREA_SQM`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

LEASE_YRS <- ggplot(data=flat_resale_sf, aes(x= `LEASE_YRS`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_CBD <- ggplot(data=flat_resale_sf, aes(x= `PROX_CBD`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_EC <- ggplot(data=flat_resale_sf, aes(x= `PROX_EC`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_HAWKER <- ggplot(data=flat_resale_sf, aes(x= `PROX_HA`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_MRT <- ggplot(data=flat_resale_sf, aes(x= `PROX_MRT`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_PARK <- ggplot(data=flat_resale_sf, aes(x= `PROX_PARK`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_GDPRI <- ggplot(data=flat_resale_sf, aes(x= `PROX_GDPRI`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_MALL <- ggplot(data=flat_resale_sf, aes(x= `PROX_MALL`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_SUP <- ggplot(data=flat_resale_sf, aes(x= `PROX_SUP`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

NUM_KIN <- ggplot(data=flat_resale_sf, aes(x= `NUM_KIN`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

NUM_CC <- ggplot(data=flat_resale_sf, aes(x= `NUM_CC`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

NUM_STOP <- ggplot(data=flat_resale_sf, aes(x= `NUM_STOP`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

NUM_PRI <- ggplot(data=flat_resale_sf, aes(x= `NUM_PRI`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

ggarrange(AREA_SQM, LEASE_YRS, PROX_CBD, PROX_EC, PROX_HAWKER, PROX_MRT, PROX_PARK, PROX_GDPRI, PROX_MALL, PROX_SUP, NUM_KIN, NUM_CC, NUM_STOP, NUM_PRI, ncol = 3, nrow = 5)

We can observe that the distribution of the majority of the independent variables are right skewed, while variables such as LEASE_YRS and PROX_CBD are slightly left skewed.

4.3 EDA using statistical point map

We want to reveal the geospatial distribution flat resale prices (i.e. dependent variable) in Singapore using the tmap package.

tmap_mode("view")

tm_shape(mpsz_sf)+
  tm_polygons() +
tm_shape(flat_resale_sf) +  
  tm_dots(col = "PRICE",
          alpha = 0.6,
          style="quantile") +
  tm_view(set.zoom.limits = c(11,14))

From the map, we can observe that flats with higher resale prices are more concentrated in the central area of Singapore.